Linear Regression


Linear Regression and Advertising Data



Simple Linear Regression


Estimation of the Parameters by Least Squares



Assessing the Accuracy of the Coefficient Estimates


Bias and Unbiased

  • Bias

    • Occurs on the basis of one particular set of observations \(y_1, ..., y_n\) that has the potential to overestimate your parameter of interest, and on the basis of another set of observations, it has the potential to underestimate your parameter of interest

    • Alternatively, unbiased estimators do not systematically over or underestimate the true parameter

      • The property of unbiasedness holds for the least squares coefficient estimates (Best Linear Unbiased Estimate or BLUE)

        • If we estimate \(B_0\) and \(B_1\) on the basis of a particular data set, then our estimates won’t be exactly equal to \(B_0\) and \(B_1\). But if we could average the estimates obtained over a huge number of data sets, then the average of these estimates would be spot on

Standard Errors and Confidence Intervals

  • One potential thought you might have concerns the accuracy of the our sample mean \(\hat{\mu}\) as an estimate of \(\mu\)

  • We have established that the average of \(\hat{\mu}\)’s over many data sets will be very close to \(\mu\), but that a single estimate \(\hat{\mu}\) may be a substantial underestimate or overestimate of \(\mu\). How far off will that single estimate of \(\hat{\mu}\) be?

  • Standard Error

    • The standard error of an estimator reflects how it varies under repeated sampling

    • The standard error of \(\mu\) can be written as \(SE(\hat{\mu})\)

    • \(Var(\hat{\mu}) = SE(\hat{\mu})^2 = \frac{\sigma^2}{n}\)

      • Notice, how this deviation shrinks with \(n\). The more observations we have, the smaller the standard error

      • Where \(\sigma\) is the standard deviation of each of the realizations of \(y_i\) of \(Y\)

    • The standard error tells us the average amount that our estimate \(\hat{\mu}\) differs from the actual value of \(\mu\)

  • These standard errors can be used to compute confidence intervals.

  • Confidence Intervals

    • A 95% confidence interval is defined as a range of values such that with 95% probability, the range will contain the true unknown value of the parameter. It has the form: \(\hat{B_1}\pm \cdot SE(\hat{B_1})\)

    • That is, there is approximately a 95% changes that the interval: \([\hat{B_1} - \cdot SE(\hat{B_1}), \hat{B_1} + \cdot SE(\hat{B_1})]\) will contain the true value of \(B_1\)

      • Under a scenario where we have repeated samples

Hypothesis Testing

  • Standard errors can also be used to perform hypothesis tests on the coefficients

  • The most common hypothesis test involves testing the null hypothesis of

    • \(H_0\): There is no relationship between \(X\) and \(Y\)

    • \(H_A\): There is some relationship between \(X\) and \(Y\)

      • \(H_0\): \(B_1\) = 0

      • \(H_A\): \(B_1\) \(\neq\) 0

        • Since if \(B_1\) = 0, then the model reduces to \(Y = B_0 + \epsilon\), and \(X\) is not associated with \(Y\)
  • To test the null hypothesis, we compute a t-statistic given by: \(t = \frac{\hat{B_1}-0}{SE(\hat{B_1})}\)

    • This will have a \(t\)-distribution with n - 2 degrees of freedom (assuming \(B_1\) = 0)

    • Using statistical software (i.e., R), it is easy to compute the probability of observing any value equal to |\(t\)| or larger

      • We call this probability the p-value

        • We interpret the p-value as follows: a small p-value indicates that it is unlikely to observe such a substantial association between the predictor and the response due to chance, in the absence of any real association between the predictor and the response

        • Hence, if we see a small p-value, then we can infer that there is an association between the predictor and the response


Assessing the Overall Accuracy of the Model


Multiple Linear Regression


Estimation and Prediction for Multiple Regression

library(tidyverse)

data <- read_csv("Advertising.csv")
## New names:
## Rows: 200 Columns: 5
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," dbl
## (5): ...1, TV, radio, newspaper, sales
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
MLR <- lm(sales ~ TV + radio + newspaper, data = data)
summary(MLR)

Some Important Questions

1). Is at least one of the predictors \(X_1, X_2, ... , X_p\), useful in predicting the response?


2). Do all the predictors help to explain \(Y\), or is it only a subset of the predictors that are useful?


3). How well does the model fit the data?


4). Given a set of predictor values, what response value should we predict, and how accurate is our prediction?


Other Considerations in the Regression Model



Extensions of the Linear Model


Removing the Additive Assumption

  • Interactions and Nonlinearity

    • Interactions:

      • In our previous analysis of the Advertising data, we assumed that the effect on sales of increasing one advertising medium is independent of the amount spent on the other media

        • For example, the linear model: \(\hat{sales} = B_0 + B_1 \times TV + B_2 \times radio + B_3 \times newspaper\)

          • states that the average effect on sales of a one-unit increase in TV is always \(B_1\), regardless of the amount spent on radio

          • But suppose that spending money on radio advertising actually increases the effectiveness of TV advertising, so that the slope term for TV should increase as radio increases

          • In this situation, given a fixed budget of $100, 000, spending half on radio and half on TV may increase sales more than allocating the entire amount to either TV or to radio

          • In marketing, this is known as a synergy effect, and in statistics it is referred to as an interaction effect

interact <- lm(sales ~ TV*radio, data = data)
summary(interact)
  • The results in this table suggests that interactions are important

  • The p-value for the interaction term \(TV \times radio\) is extremely low, indicating that there is strong evidence for \(H_A:B_3 \neq 0\)

  • The \(R^2\) for the interaction model is 96.8% compared to only 89.7% for the model that predicts sales using TV and radio without an interaction term

  • This means that (96.8 − 89.7)/(100 − 89.7) = 69% of the variability in sales that remains after fitting the additive model has been explained by the interaction term

  • The coefficient estimates in the table suggest that an increase in TV advertising of 1,000 dollars is associated with increased sales of \((\hat{B_1}+\hat{B_3}\times radio) \times 1000 = 19 + 1.1 \times radio\) units

  • The coefficient estimates in the table suggest that an increase in radio advertising of 1,000 dollars is associated with increased sales of \((\hat{B_2}+\hat{B_3}\times TV) \times 1000 = 29 + 1.1 \times TV\) units


Hierarchy

  • Sometimes it is the case that an interaction term has a very small p-value, but the associated main effects (in this case, TV and radio) do not

  • The hierarchy principle:

    • If we include an interaction in a model, we should also include the main effects, even if the p-values associated with their coefficients are not significant

    • The rationale for this principle is that interactions are hard to interpret in a model without main effects — their meaning is changed

    • Specifically, the interaction terms also contain main effects, if the model has no main effect terms


Potential Problems

1). Non-linearity of the response-predictor relationships

MLR2 <- lm(sales ~ TV + radio + newspaper, data = data)
summary(MLR2)

par(mfrow = c(2, 2))
plot(MLR2)


MLR2 <- lm(sales ~ TV + radio + newspaper, data = data)
summary(MLR2)

datamod <- data %>%
  mutate(modfitted = predict(MLR2),
         modresiduals = residuals(MLR2))

ggplot(data = datamod, mapping = aes(x = modfitted, y = modresiduals)) +
  geom_point() + 
  geom_hline(yintercept = 0, linetype="dashed", color = "red") +
  geom_smooth(se = FALSE) +
  scale_y_continuous(breaks = seq(-9, 6, by = 3), limits = c(-9, 6)) +
  theme_bw() +
  theme(plot.background = element_blank(),
    panel.grid.minor = element_blank(),
    panel.grid.major = element_blank())
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'



2). Correlation of error terms


3). Non-constant variance of error terms


4). Outliers


5). High-leverage points

6). Collinearity


Comparison of Linear Regression with K-Nearest Neighbors


Parametric vs. Non-parametric

  • When will a parametric approach outperform a non-parametric approach?

    • A parametric approach will outperform the non-parametric approach if the parametric form that has been selected is close to the true form of \(f\)


  • The above provides an example of KNN regression on data with 50 observations

    • The true relationship is given by the black solid line

    • The blue curve corresponds to \(K = 1\) (on the left) and \(K = 9\) (on the right)

    • In this instance, the \(K=1\) predictions are far too variable, while the smoother \(K=9\) fit is much closer to \(f(X)\)

      • However, since the true relationship is linear, it is hard for a non-parametric approach to compete with linear regression: a non-parametric approach incurs a cost in variance that is not offset by a reduction in bias


  • As a general rule, parametric methods will tend to outperform non-parametric approaches when there is a small number of observations per predictor

    • Curse of Dimensionality

      • As the number of predictors increase, relative to the number of observations, KNN’s performance will decrease